AITopics | domain-specific data

Collaborating Authors

domain-specific data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

How Well Do LLMs Predict Human Behavior? A Measure of their Pretrained Knowledge

Gao, Wayne, Han, Sukjin, Liang, Annie

arXiv.org Machine LearningJan-21-2026

Large language models (LLMs) are increasingly used in economics as predictive tools--both to generate synthetic responses in place of human subjects (Horton, 2023; Anthis et al., 2025), and to forecast economic outcomes directly (Hewitt et al., 2024a; Faria-e Castro and Leibovici, 2024; Chan-Lau et al., 2025). Their appeal in these roles is obvious: A pretrained LLM embeds a vast amount of information and can be deployed at negligible cost, often in settings where collecting new, domain-specific human data would be expensive or infeasible. What remains unclear is how to assess the quality of these predictions. This paper proposes a measure that quantifies the domain-specific value of LLMs in an interpretable unit: the amount of human data they substitute for. Specifically, we ask how much human data would be required for a conventional model trained on that data to match the predictive performance of the pretrained LLM in that domain.

large language model, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

2601.12343

Country: North America > United States (1.00)

Genre: Research Report > Experimental Study (0.93)

Industry:

Health & Medicine (0.93)
Government > Regional Government > North America Government > United States Government (0.92)
Banking & Finance > Economy (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)

Add feedback

Evaluating Differentially Private Generation of Domain-Specific Text

Sun, Yidan, Schlegel, Viktor, Nandakumar, Srinivasan, Zahid, Iqra, Wu, Yuping, Del-Pinto, Warren, Nenadic, Goran, Lam, Siew-Kei, Zhang, Jie, Bharath, Anil A

arXiv.org Artificial IntelligenceSep-1-2025

Generative AI offers transformative potential for high-stakes domains such as healthcare and finance, yet privacy and regulatory barriers hinder the use of real-world data. To address this, differentially private synthetic data generation has emerged as a promising alternative. In this work, we introduce a unified benchmark to systematically evaluate the utility and fidelity of text datasets generated under formal Differential Privacy (DP) guarantees. Our benchmark addresses key challenges in domain-specific benchmarking, including choice of representative data and realistic privacy budgets, accounting for pre-training and a variety of evaluation metrics. We assess state-of-the-art privacy-preserving generation methods across five domain-specific datasets, revealing significant utility and fidelity degradation compared to real data, especially under strict privacy constraints. These findings underscore the limitations of current approaches, outline the need for advanced privacy-preserving data sharing methods and set a precedent regarding their evaluation in realistic scenarios.

data mining, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.20452

Country: North America > United States (0.29)

Genre:

Research Report > Promising Solution (0.46)
Research Report > New Finding (0.34)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting

Ding, Fei, Wang, Baiqiao

arXiv.org Artificial IntelligenceJul-1-2025

Supervised Fine-Tuning (SFT) is a critical step for enhancing the instruction-following capabilities of Large Language Models (LLMs) and adapting them to specialized domains. However, SFT often leads to a degradation of the model's general abilities, a phenomenon known as catastrophic forgetting. This problem is exacerbated when third-party practitioners fine-tune open-source models, as the original SFT data is typically not available. To address this challenge, we propose a novel and cost-effective SFT method that effectively mitigates catastrophic forgetting without requiring access to the original SFT data. Our approach first reconstructs the likely instruction distribution of the base model. It then employs a multi-model generation and filtering pipeline to synthesize a high-quality general-purpose dataset. This synthetic dataset is mixed with new, domain-specific data for fine-tuning. Experimental results show that our method not only preserves the model's capabilities in general domains but also improves task-specific performance, outperforming baselines that use publicly available SFT datasets.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.09428

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

TrustDataFilter:Leveraging Trusted Knowledge Base Data for More Effective Filtering of Unknown Information

Zhang, Jinghong, Cui, Yidong, Wang, Weiling, Cheng, Xianyou

arXiv.org Artificial IntelligenceJan-24-2025

With the advancement of technology and changes in the market, the demand for the construction of domain-specific knowledge bases has been increasing, either to improve model performance or to promote enterprise innovation and competitiveness. The construction of domain-specific knowledge bases typically relies on web crawlers or existing industry databases, leading to problems with accuracy and consistency of the data. To address these challenges, we considered the characteristics of domain data, where internal knowledge is interconnected, and proposed the Self-Natural Language Inference Data Filtering (self-nli-TDF) framework. This framework compares trusted filtered knowledge with the data to be filtered, deducing the reasoning relationship between them, thus improving filtering performance. The framework uses plug-and-play large language models for trustworthiness assessment and employs the RoBERTa-MNLI model from the NLI domain for reasoning. We constructed three datasets in the domains of biology, radiation, and science, and conducted experiments using RoBERTa, GPT3.5, and the local Qwen2 model. The experimental results show that this framework improves filter quality, producing more consistent and reliable filtering results.

knowledge management, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2502.15714

Country:

North America > United States (0.28)
Asia > China > Beijing > Beijing (0.04)
Europe > Greece (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Knowledge Management > Knowledge Engineering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Matching domain experts by training from scratch on domain knowledge

Luo, Xiaoliang, Sun, Guangzhi, Love, Bradley C.

arXiv.org Artificial IntelligenceJul-2-2024

Recently, large language models (LLMs) have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.

brainbench, neuroscience literature, training data, (16 more...)

arXiv.org Artificial Intelligence

2405.09395

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.78)

Add feedback

EcomGPT-CT: Continual Pre-training of E-commerce Large Language Models with Semi-structured Data

Ma, Shirong, Huang, Shen, Huang, Shulin, Wang, Xiaobin, Li, Yangning, Zheng, Hai-Tao, Xie, Pengjun, Huang, Fei, Jiang, Yong

arXiv.org Artificial IntelligenceDec-25-2023

Large Language Models (LLMs) pre-trained on massive corpora have exhibited remarkable performance on various NLP tasks. However, applying these models to specific domains still poses significant challenges, such as lack of domain knowledge, limited capacity to leverage domain knowledge and inadequate adaptation to domain-specific data formats. Considering the exorbitant cost of training LLMs from scratch and the scarcity of annotated data within particular domains, in this work, we focus on domain-specific continual pre-training of LLMs using E-commerce domain as an exemplar. Specifically, we explore the impact of continual pre-training on LLMs employing unlabeled general and E-commercial corpora. Furthermore, we design a mixing strategy among different data sources to better leverage E-commercial semi-structured data. We construct multiple tasks to assess LLMs' few-shot In-context Learning ability and their zero-shot performance after instruction tuning in E-commerce domain. Experimental results demonstrate the effectiveness of continual pre-training of E-commerce LLMs and the efficacy of our devised data mixing strategy.

computational linguistic, language model, llm, (15 more...)

arXiv.org Artificial Intelligence

2312.15696

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(6 more...)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Services > e-Commerce Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Sumil Khosla on LinkedIn: FinancialAdvisor.AI

#artificialintelligenceApr-14-2023, 07:12:57 GMT

Is it worth training your own large language model (LLM) on domain-specific data from scratch? Researchers at Bloomberg did just that and shared a detailed technical report describing the dataset, model configuration, and training procedure. The core question is, is it worth training the LLM from scratch? In my experience, it makes total sense if we want to apply LLMs to novel data sources (e.g., protein amino acid sequences as ProtBERT demonstrated). BloombergGPT is a 50-billion parameter language model for finance, trained on 363 billion tokens from finance data and 345 billion tokens from a general dataset.

domain-specific data, financialadvisor, sumil khosla, (6 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.42)

Add feedback

Context Matters: A Strategy to Pre-train Language Model for Science Education

Liu, Zhengliang, He, Xinyu, Liu, Lei, Liu, Tianming, Zhai, Xiaoming

arXiv.org Artificial IntelligenceJan-27-2023

This study aims at improving the performance of scoring student responses in science education automatically. BERT-based language models have shown significant superiority over traditional NLP models in various language-related tasks. However, science writing of students, including argumentation and explanation, is domain-specific. In addition, the language used by students is different from the language in journals and Wikipedia, which are training sources of BERT and its existing variants. All these suggest that a domain-specific model pre-trained using science education data may improve model performance. However, the ideal type of data to contextualize pre-trained language model and improve the performance in automatically scoring student written responses remains unclear. Therefore, we employ different data in this study to contextualize both BERT and SciBERT models and compare their performance on automatic scoring of assessment tasks for scientific argumentation. We use three datasets to pre-train the model: 1) journal articles in science education, 2) a large dataset of students' written responses (sample size over 50,000), and 3) a small dataset of students' written responses of scientific argumentation tasks. Our experimental results show that in-domain training corpora constructed from science questions and responses improve language model performance on a wide variety of downstream tasks. Our study confirms the effectiveness of continual pre-training on domain-specific data in the education domain and demonstrates a generalizable strategy for automating science education tasks with high accuracy. We plan to release our data and SciEdBERT models for public use and community engagement.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-36336-8_103

2301.12031

Country:

North America > United States > Georgia > Clarke County > Athens (0.14)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report > New Finding (0.54)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Prompt Tuning GPT-2 language model for parameter-efficient domain adaptation of ASR systems

Dingliwal, Saket, Shenoy, Ashish, Bodapati, Sravan, Gandhe, Ankur, Gadde, Ravi Teja, Kirchhoff, Katrin

arXiv.org Artificial IntelligenceJul-21-2022

Automatic Speech Recognition (ASR) systems have found their use in numerous industrial applications in very diverse domains creating a need to adapt to new domains with small memory and deployment overhead. In this work, we introduce domain-prompts, a methodology that involves training a small number of domain embedding parameters to prime a Transformer-based Language Model (LM) to a particular domain. Using this domain-adapted LM for rescoring ASR hypotheses can achieve 7-13% WER reduction for a new domain with just 1000 unlabeled textual domain-specific sentences. This improvement is comparable or even better than fully fine-tuned models even though just 0.02% of the parameters of the base LM are updated. Additionally, our method is deployment-friendly as the learnt domain embeddings are prefixed to the input to the model rather than changing the base model architecture. Therefore, our method is an ideal choice for on-the-fly adaptation of LMs used in ASR systems to progressively scale it to new domains.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2112.08718

Country:

Europe > Spain > Galicia > Madrid (0.04)
Asia > Middle East > UAE > Dubai Emirate > Dubai (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.87)

Add feedback

10 NLP Predictions for 2022

#artificialintelligenceJan-13-2022, 08:50:13 GMT

Natural language processing (NLP) has been one of the hottest sectors in AI over the past two years. Will the string of big data breakthroughs continue into 2022? We checked in with industry experts to find out. There's been a veritable arms race to develop large transformer models over the past couple of years. It started in 2020 with OpenAI's GPT-3 with 175 billion parameters.

language model, prediction, provider, (17 more...)

#artificialintelligence

Industry: Information Technology (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.90)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.49)

Add feedback